Contents¶

  • Understand Business Problem

  • Data Overview

  • Data Loading
    • Preprocessing Data
    • Preparing Data
  • EDA
  • Feature engineering
  • Customer segmentation

Understand Business Problem ¶

Creation of a model to predict unit price of a car for retail segment based on the previous contracts data.

Data Overview ¶

Dataset name : "Imaginary Auto Company_Sep 22.xlsx".

  • booking to delivery date : Delivery time post contract(medium, long, short, immediate).
  • unit price($000) : Price per unit
  • type of sale : States whether sale is of type wholesale or retail
  • date of sale : Date when contract was entered in a yyyy-mm-dd format
  • mode of transport : States mode of transport for delivering products(Truck, container, barge)
  • no of units : Units sold per contract
  • cust_group_name : Group that the customer belongs to.
  • cust_segment_name : Customer segment (Star, prime, excellent or more)
  • Customer Id : Unique customer id
  • delivery_month_from : shows when delivery of a contract starts
  • delivery_month_to : shows when delivery of a contract ends
  • type : shows whether product is manual or automatic
  • item_class_l1 : shows approximate price range a product belongs to (basic, premium, mid-range)
  • item_class_l2 : shows brand of product
  • item_class_l3 : classifies product into different buckets within a brand
  • colour : shows colour of product
  • plant_nr : shows plant number at which product is produced
  • plant_city : shows city of production
  • plant_country_name : shows country of production(Phillipines and Vietnam)
  • most expensive part_l1 : shows which part of product is most expensive
  • most expensive part_l2 : classifies product into different buckets per most expensive part l1
  • shipto_city : shows city to which product is shipped
  • shipto_nr : shows customers's shipping id
  • soldto_city : shows city to which product is sold (note ship to city and sold to city can be different)
  • soldto_nr : shows customers's sold to number
  • car_nr : shows product's unique number
  • shipto_country : shows country to which product is shipped
  • soldto_country : shows country to which product is sold
In [1]:
#!pip install missingno
In [3]:
import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly.express as px
pd.set_option('display.max_rows',30)
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 1000)
pd.options.mode.chained_assignment = None

import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import streamlit as st
import datetime as dt
import warnings
warnings.filterwarnings('ignore')
In [4]:
plt.style.use('fivethirtyeight')
DATA_URL = ".\Imaginary Auto Company_Sep 22.xlsx"

Data loading ¶

In [5]:
def load_data():
    data = pd.read_excel(DATA_URL,sheet_name = "Contracts data",engine = "openpyxl")
    return data
df = load_data()
In [6]:
#df = pd.read_excel(".\Imaginary Auto Company_Sep 22.xlsx",sheet_name = "Contracts data",engine = "openpyxl")
In [7]:
df.head()
Out[7]:
Unnamed: 0 booking to delivery date unit price($000) type of sale date of sale mode of transport no of units cust_group_name cust_segment_name Customer Id delivery_month_from delivery_month_to type item_class_l1 item_class_l2 item_class_l3 colour plant_nr plant_city plant_country_name most expensive part_l1 most expensive part_l2 shipto_city shipto_nr soldto_city soldto_nr car_nr shipto_country soldto_country
0 3 long 1662.54 retail 2159-09-26 Truck 2640.00 aa Prime 953.0-952.0 2160-04-01 2161-10-31 manual Premium Pagani Pagani1 . 1085 Quezon City Philippines bonnet bonner Algeria 953 Algeria 952 Pag99992258 Algeria Algeria
1 4 long 1662.54 retail 2159-09-26 Truck 0.00 aa Prime 953.0-952.0 2160-04-01 2161-10-31 manual Premium Pagani Pagani1 . 1085 Quezon City Philippines bonnet bonner Algeria 953 Algeria 952 Pag99992258 Algeria Algeria
2 5 long 1662.54 retail 2159-09-26 Truck 5720.00 aa Prime 953.0-952.0 2160-04-01 2161-10-31 manual Premium Pagani Pagani1 . 1085 Quezon City Philippines bonnet bonner Algeria 953 Algeria 952 Pag99992258 Algeria Algeria
3 6 long 1896.18 retail 2161-02-26 Truck 2051.28 aa Prime 953.0-952.0 2160-04-01 2161-10-31 manual Premium Pagani Pagani1 . 1085 Quezon City Philippines bonnet bonner Algeria 953 Algeria 952 Pag99992258 Algeria Algeria
4 7 long 1206.26 retail 2160-06-22 Truck 1386.00 aa Prime 953.0-952.0 2160-04-01 2161-10-31 manual Premium Pagani Pagani1 . 1085 Quezon City Philippines bonnet bonner Algeria 953 Algeria 952 Pag99992258 Algeria Algeria

Preprocessing Data ¶

In [8]:
numCols = df.select_dtypes("number").columns
catCols = df.select_dtypes("object").columns
numCols= list(set(numCols))
catCols= list(set(catCols))
print("Numerical columns : ",", ".join(numCols),end = "\n")
print("Categorical columns : ",", ".join(catCols),end = " ")
Numerical columns :  unit price($000), soldto_nr, Unnamed: 0, shipto_nr, no of units, plant_nr
Categorical columns :  mode of transport, type, soldto_city, car_nr, plant_city, shipto_country, item_class_l2, most expensive part_l2, type of sale, item_class_l1, plant_country_name, cust_group_name, colour, soldto_country, most expensive part_l1, booking to delivery date, cust_segment_name, item_class_l3, shipto_city, Customer Id 
In [9]:
df['type of sale'].value_counts()
Out[9]:
wholesale    59352
retail       21141
Name: type of sale, dtype: int64

Preparing Data ¶

Filter based on retail type of sale¶

In [10]:
retail_df = df.loc[df['type of sale'] == 'retail']

Renaming columns to increase readability¶

In [11]:
retail_df.rename(columns={'unit price($000)': 'unit_price_k','no of units':'no_units',
                          'booking to delivery date':'delivery_duration','most expensive part_l2': 'most_expensive_part_l2',
                         'type of sale':'type_of_sale','date of sale':'date_of_sale','mode of transport':'mode_of_transport',
                          'Customer Id':'customer_id','most expensive part_l1': 'most_expensive_part_l1'
                         },inplace = True)
In [12]:
retail_df.drop(['Unnamed: 0'],axis = 1,inplace=True)

Lets look at the different numerical and categorical columns in the dataset.¶

In [13]:
numCols = retail_df.select_dtypes("number").columns
catCols = retail_df.select_dtypes("object").columns
numCols= list(set(numCols))
catCols= list(set(catCols))
print("Numerical columns : ",", ".join(numCols),end = "\n")
print("Categorical columns : ",", ".join(catCols),end = " ")
Numerical columns :  unit_price_k, soldto_nr, shipto_nr, plant_nr, no_units
Categorical columns :  type, soldto_city, car_nr, plant_city, shipto_country, item_class_l2, delivery_duration, item_class_l1, plant_country_name, cust_group_name, colour, most_expensive_part_l2, soldto_country, type_of_sale, cust_segment_name, customer_id, mode_of_transport, item_class_l3, most_expensive_part_l1, shipto_city 
In [14]:
print(f"Dataset has {retail_df.shape[0]} rows and {retail_df.shape[1]} columns")
Dataset has 21141 rows and 28 columns
In [15]:
retail_df.isnull().sum()
Out[15]:
delivery_duration          0
unit_price_k               0
type_of_sale               0
date_of_sale               0
mode_of_transport          0
no_units                   0
cust_group_name            0
cust_segment_name         12
customer_id                0
delivery_month_from        0
delivery_month_to          0
type                       0
item_class_l1              0
item_class_l2              0
item_class_l3              0
colour                     0
plant_nr                   0
plant_city                 1
plant_country_name         1
most_expensive_part_l1    12
most_expensive_part_l2    24
shipto_city               87
shipto_nr                  0
soldto_city                0
soldto_nr                  0
car_nr                     0
shipto_country             0
soldto_country             0
dtype: int64
In [16]:
#retail_df = retail_df.fillna(retail_df.mode().iloc(0))
In [17]:
retail_df.isnull().sum()
Out[17]:
delivery_duration          0
unit_price_k               0
type_of_sale               0
date_of_sale               0
mode_of_transport          0
no_units                   0
cust_group_name            0
cust_segment_name         12
customer_id                0
delivery_month_from        0
delivery_month_to          0
type                       0
item_class_l1              0
item_class_l2              0
item_class_l3              0
colour                     0
plant_nr                   0
plant_city                 1
plant_country_name         1
most_expensive_part_l1    12
most_expensive_part_l2    24
shipto_city               87
shipto_nr                  0
soldto_city                0
soldto_nr                  0
car_nr                     0
shipto_country             0
soldto_country             0
dtype: int64

EDA ¶

Units not sold hence the unit price will account to loss¶

In [18]:
no_units_sold = retail_df[retail_df['no_units'] == 0]
print("Number of units not sold",no_units_sold.shape[0])
print("Costs involved with these unsold units",round(no_units_sold['unit_price_k'].sum(),2),"$")
print("Percentage loss : ",round((no_units_sold['unit_price_k'].sum()/retail_df['unit_price_k'].sum())*100,2),"%")
Number of units not sold 2706
Costs involved with these unsold units 4499978.21 $
Percentage loss :  13.5 %
In [19]:
retail_df.colour.value_counts()
Out[19]:
.              20652
Olive            105
Purple            95
Red               91
Lavender          81
Mauvre            40
Green             36
Yellow            26
Dark Purple        7
Dark Green         5
Ochre              1
White              1
Teal               1
Name: colour, dtype: int64

We see a huge chunk of data with the color as .(dot), assuming the mostly favoured colour as Black, lets replace the values to 'Black' to increase the readability¶

In [20]:
retail_df['colour'] = retail_df['colour'].apply(lambda x:x.replace(".","Black"))
In [21]:
sns.countplot(retail_df.colour)
plt.xticks(rotation = 45);
In [ ]:
 

Bivariate analysis¶

Lets see the boxplots of the features mode_of_transport, type and plant_country_name against unit_prices¶

In [22]:
plt.figure(figsize=(20, 5))
plt.subplot(1,3,1)
sns.boxplot(x = 'mode_of_transport', y = 'unit_price_k', data = retail_df)

plt.subplot(1,3,2)
sns.boxplot(x = 'type', y = 'unit_price_k', data = retail_df)

plt.subplot(1,3,3)
sns.boxplot(x = 'plant_country_name', y = 'unit_price_k', data = retail_df)
Out[22]:
<AxesSubplot:xlabel='plant_country_name', ylabel='unit_price_k'>

Above plots suggest most of the purchases and sales happen by cars transported in a container, and also suggests more automatic cars preferred over manual and seems like Vietnam manufactures profit making cars¶

  • Additionally, there a value which is seen as an outlier and thus repressing all the values in the box plots, lets explore more on that in sometime

Lets see the boxplots of the features mode_of_transport, type and plant_country_name against number of units sold¶

In [23]:
plt.figure(figsize=(20, 5))

plt.subplot(1,3,1)
sns.boxplot(x = 'mode_of_transport', y = 'no_units', data = retail_df)

plt.subplot(1,3,2)
sns.boxplot(x = 'type', y = 'no_units', data = retail_df)

plt.subplot(1,3,3)
sns.boxplot(x = 'plant_country_name', y = 'no_units', data = retail_df)
Out[23]:
<AxesSubplot:xlabel='plant_country_name', ylabel='no_units'>

Univariate analysis¶

In [24]:
fig = plt.figure(figsize = (12,3))
plt.subplot(1,2,1)
sns.histplot(retail_df['no_units'],bins = 40 ,color='r')

plt.subplot(1,2,2)
sns.histplot(retail_df['unit_price_k'],bins = 40,color='b')

plt.subplots_adjust(wspace = 0.8)

Both the continous variable are right skewed.¶

Outliers¶

In [25]:
retail_df.loc[(retail_df['unit_price_k']>50000)]
Out[25]:
delivery_duration unit_price_k type_of_sale date_of_sale mode_of_transport no_units cust_group_name cust_segment_name customer_id delivery_month_from delivery_month_to type item_class_l1 item_class_l2 item_class_l3 colour plant_nr plant_city plant_country_name most_expensive_part_l1 most_expensive_part_l2 shipto_city shipto_nr soldto_city soldto_nr car_nr shipto_country soldto_country
58302 immediate 193971.58 retail 2159-03-15 Truck 0.22 tc Prime 4190.0-4195.0 2159-03-15 2159-03-31 manual Premium Aspark Aspark7 Black 1085 Quezon City Philippines Window regulator glass Kétou 4190 Boussé 4195 Asp99990000 Benin Burkina Faso
There was one record with an extreme high price which was an influence the price distribution, and since the presence of this brand sales which is 'Aspark7' and these transactions are more frequent in wholesales type of sale, lets remove that one record here to proceed further¶
In [26]:
px.box(retail_df['unit_price_k'])
In [27]:
sns.jointplot(x = retail_df['no_units'], y = retail_df['unit_price_k'], hue = retail_df["plant_country_name"])
Out[27]:
<seaborn.axisgrid.JointGrid at 0x17cc5a96250>
In [28]:
retail_df = retail_df.loc[(retail_df['unit_price_k']<50000)]
Below plot shows scatter plot and the individual distribution plots¶
In [29]:
retail_df['unit_price_k'].describe()
Out[29]:
count    21140.000000
mean      1567.710181
std       1728.860511
min     -16287.260000
25%        497.420000
50%       1213.700466
75%       1804.000000
max      33114.840000
Name: unit_price_k, dtype: float64
In the box plot above, there is a contract that costs much higher than the rest of the contracts, since this looks like a point of influence, removing this record.¶
In [30]:
retail_df = retail_df.loc[(retail_df['unit_price_k']<50000)]
In [31]:
px.box(retail_df['unit_price_k'])
In [32]:
sns.displot(retail_df['unit_price_k'], bins = 60)
Out[32]:
<seaborn.axisgrid.FacetGrid at 0x17cc4424340>
In [33]:
sns.displot(retail_df['no_units'], bins = 60)
Out[33]:
<seaborn.axisgrid.FacetGrid at 0x17cc3ff87f0>
In [34]:
sns.relplot(x = retail_df['no_units'], y = retail_df['unit_price_k'],style = retail_df['plant_country_name'],hue = retail_df['mode_of_transport'])
plt.xticks(rotation = 45);
In [35]:
#outl_df = retail_df.loc[(retail_df['unit price($000)']<0)]
In [36]:
print(f"Total unit price for outliers detected $",round(retail_df['unit_price_k'].sum(),2))
print("Number of units detected in the outliers ", retail_df['no_units'].sum())
Total unit price for outliers detected $ 33141393.22
Number of units detected in the outliers  85015197.454
In [37]:
fig,ax = plt.subplots(4,1,figsize = (7,25))
ax1,ax2,ax3,ax4 = ax.flatten()

# cnt = retail_df.groupby(['colour']).sum()['unit_price_k'].sort_values(ascending = False).to_frame()[:20]
# sns.barplot(x = cnt['unit_price_k'],y = cnt.index,ax = ax1 ,palette= 'winter')
# ax1.set_title('Price distribution according to colour of the car')

cnt = retail_df.groupby(['shipto_country']).sum()['unit_price_k'].sort_values(ascending = False).to_frame()[:20]
sns.barplot(x = cnt['unit_price_k'],y = cnt.index,ax = ax1 ,palette= 'inferno')
ax1.set_title('Countries with shipped price distribution')


cnt = retail_df.groupby(['most_expensive_part_l1']).sum()['unit_price_k'].sort_values(ascending = False).to_frame()[:20]
sns.barplot(x = cnt['unit_price_k'],y = cnt.index,ax = ax2 ,palette= 'viridis')
ax2.set_title('Expensive parts for cars')


cnt = retail_df.groupby(['soldto_country']).sum()['unit_price_k'].sort_values(ascending = False).to_frame()[:20]
sns.barplot(x = cnt['unit_price_k'],y = cnt.index,ax = ax3 ,palette= 'summer')
ax3.set_title('Maximum sales per country')

cnt = retail_df.groupby(['plant_city']).sum()['unit_price_k'].sort_values(ascending = False).to_frame()[:20]
sns.barplot(x = cnt['unit_price_k'],y = cnt.index,ax = ax4 ,palette= 'autumn')
ax4.set_title('Plant city that manufactures the high profit sales')
plt.subplots_adjust(wspace = 0.8,hspace=0.3)
In [38]:
fig,ax = plt.subplots(4,1,figsize = (7,25))
ax1,ax2,ax3,ax4 = ax.flatten()

# cnt = retail_df.groupby(['colour']).sum()['no_units'].sort_values(ascending = False).to_frame()[:10]
# sns.barplot(x = cnt['no_units'],y = cnt.index,ax = ax1 ,palette= 'winter')
# ax1.set_title('Number of units sold according to colour of the car')

cnt = retail_df.groupby(['shipto_country']).sum()['no_units'].sort_values(ascending = False).to_frame()[:15]
sns.barplot(x = cnt['no_units'],y = cnt.index,ax = ax1 ,palette= 'inferno')
ax1.set_title('Number of units shipped per coutry')


cnt = retail_df.groupby(['most_expensive_part_l1']).sum()['no_units'].sort_values(ascending = False).to_frame()[:20]
sns.barplot(x = cnt['no_units'],y = cnt.index,ax = ax2 ,palette= 'viridis')
ax2.set_title('Units sold with most desired expensive parts')


cnt = retail_df.groupby(['soldto_country']).sum()['no_units'].sort_values(ascending = False).to_frame()[:10]
sns.barplot(x = cnt['no_units'],y = cnt.index,ax = ax3 ,palette= 'summer')
ax3.set_title('Maximum units sold per country')

cnt = retail_df.groupby(['plant_city']).sum()['no_units'].sort_values(ascending = False).to_frame()[:10]
sns.barplot(x = cnt['no_units'],y = cnt.index,ax = ax4 ,palette= 'autumn')
ax4.set_title('City that manufactures the highest number of units')
plt.subplots_adjust(wspace = 0.8,hspace=0.3)
In [39]:
plt.figure(figsize = (8,5))
price_item_class_l2 = retail_df.groupby(['cust_group_name']).sum()['unit_price_k'].sort_values(ascending = False).to_frame()[:20]
sns.barplot(x = 'unit_price_k', y = price_item_class_l2.index,data = price_item_class_l2,palette='spring')
plt.title('')
plt.xlabel('unit_price_k',fontsize = 15)
plt.ylabel('Customer group name',fontsize = 10,rotation = 'vertical')
Out[39]:
Text(0, 0.5, 'Customer group name')
In [40]:
plt.figure(figsize = (8,3))
price_item_class_l2 = retail_df.groupby(['cust_segment_name']).sum()['unit_price_k'].sort_values(ascending = False).to_frame()[:20]
sns.barplot(x = 'unit_price_k', y = price_item_class_l2.index,data = price_item_class_l2,palette='spring')
plt.title('Revenue per customer group',fontsize = 15);
plt.xlabel('unit price($000)',fontsize = 15)
plt.ylabel('Customer group name',fontsize = 15,rotation = 'vertical');
In [41]:
plt.figure(figsize = (12,5))

cnt = retail_df.groupby(['item_class_l2']).mean()['unit_price_k'].sort_values(ascending = False).to_frame()[:50]
sns.barplot(x = cnt['unit_price_k'],y = cnt.index,palette= 'winter')
plt.title("Revenue per brand")
plt.xticks(rotation = 45);

Bugatti seems to make the maximum profit¶

In [42]:
plt.figure(figsize = (12,8))
cnt = retail_df.groupby(['item_class_l3']).mean()['unit_price_k'].sort_values(ascending = False).to_frame()[:50]
sns.barplot(x = cnt['unit_price_k'],y = cnt.index,palette=  'twilight_shifted')
ax2.set_title('Revenue per model')
#plt.yticks(rotation = 45)
plt.subplots_adjust(wspace = 0.8)

Looks like we have data with negative values, presuming it to be loss making and making no sales, lets leave out these values for our analysis¶

In [43]:
retail_df_neg = retail_df.loc[retail_df['unit_price_k'] > 0]

Lets understand brand v/s units sold¶

  • Lets have a look at more granular level with respect to the brand model item_class_l3
In [44]:
sales_df = retail_df_neg.groupby(['item_class_l3','cust_segment_name']).agg({'unit_price_k':'mean','no_units':'mean'}).\
            sort_values(by = "no_units",ascending=False)
In [45]:
sales_df.reset_index(inplace = True)
sales_df
Out[45]:
item_class_l3 cust_segment_name unit_price_k no_units
0 Bentley Excellent 521.335513 14043.668618
1 Pagani1 More 1176.169780 11380.355890
2 Ferrari Prime 1397.517853 9091.578461
3 Aspark11 Excellent 1454.371525 8874.576271
4 Bentley Star 539.645315 7718.549747
... ... ... ... ...
93 Pagani2 Excellent 6746.520000 28.233333
94 Rolls-Royce2 Prime 1980.000000 22.000000
95 Rolls-Royce4 Prime 704.000000 22.000000
96 Buggati Excellent 4400.000000 22.000000
97 Aston Martin4 Star 3740.000000 4.400000

98 rows × 4 columns

In [46]:
px.scatter(data_frame = sales_df,x = 'unit_price_k',y = 'no_units',hover_data=['cust_segment_name'],color = 'item_class_l3')
In [47]:
sales_grp_cust_df = retail_df_neg.groupby(['item_class_l3','cust_group_name']).agg({'unit_price_k':'sum','no_units':'sum'}).\
            sort_values(by = "unit_price_k",ascending=False)
sales_grp_cust_df.reset_index(inplace = True)
In [48]:
px.scatter(data_frame = sales_grp_cust_df,x = 'unit_price_k',y = 'no_units',hover_data=['cust_group_name'],color = 'item_class_l3')

Considering the larger picture, brands that make the most profit and most number of sales¶

In [49]:
sales_l2_df = retail_df_neg.groupby(['item_class_l2']).agg({'unit_price_k':'sum','no_units':'sum'}).\
            sort_values(by = "no_units",ascending=False)
sales_l2_df.reset_index(inplace = True)
In [50]:
sales_l2_df
Out[50]:
item_class_l2 unit_price_k no_units
0 Bentley 1.626001e+06 2.236847e+07
1 Lotus Cars 3.809138e+06 1.621270e+07
2 Ferrari 3.228916e+06 1.399056e+07
3 Aspark 4.503512e+06 6.433299e+06
4 Pagani 5.386945e+06 5.726159e+06
5 W Motors 3.304286e+06 4.008274e+06
6 McLaren Automotive 1.143466e+06 3.226757e+06
7 Rolls-Royce 4.705128e+06 2.595787e+06
8 Aston Martin 1.579900e+06 1.389890e+06
9 Bugatti 2.310537e+06 9.569140e+05
10 Mercedes-Benz 1.054640e+06 5.648836e+05
11 Lamborghini 3.967731e+05 3.513853e+05
12 Rimac Automobili 2.694167e+05 2.455898e+05
13 Koenigsegg 2.211000e+03 2.200000e+03
In [51]:
px.scatter(data_frame = sales_l2_df,x = 'unit_price_k',y = 'no_units',hover_data=['item_class_l2'],color = 'item_class_l2',)

Insights¶

  • From the above 2 graphs, we can conclude that the highest sales making Brand is Bentley(crowd favourite) and maximum profit making brand is Pagani.
  • Brands like Koenigsegg, Rimac Automobili are barely making any sales thus not generating revenue.
  • Mid sized brands like
In [52]:
retail_df_neg.shape
Out[52]:
(19432, 28)

Based on the above graphs lets have a look at the brands that have high sales¶

In [53]:
high_sales_brand_df = retail_df_neg.loc[(retail_df_neg['item_class_l2'] == 'Bentley') | (retail_df_neg['item_class_l2'] == 'Lotus Cars') \
                                    | (retail_df_neg['item_class_l2'] == 'Ferrari') | (retail_df_neg['item_class_l2'] == 'Aspark')]
In [54]:
high_sales_brand_df.reset_index(inplace = True)
In [55]:
high_sales_brand_df.groupby('item_class_l2').agg({'unit_price_k':'mean'}).sort_values(by = "unit_price_k",ascending=False)
Out[55]:
unit_price_k
item_class_l2
Aspark 2055.459627
Ferrari 1496.253948
Lotus Cars 998.725128
Bentley 539.840945
Aspark has the maximum profit observed¶

Feature engineering ¶

One hot encoding and label encoding¶

In [56]:
catCols
for column in catCols:
     retail_df[column] = pd.factorize(retail_df[column])[0]

Recent transactions show higher correlation , observe rolling_mean_t7 with unit_price_k¶

In [57]:
retail_df = retail_df.drop_duplicates()
In [58]:
def date_features(retail_df):
    
    retail_df["date_of_sale"] = pd.to_datetime(retail_df["date_of_sale"])
    retail_df["day"] = retail_df.date.dt.day
    retail_df["month"] = retail_df.date.dt.month
    retail_df["week_day"] = retail_df.date.dt.weekday

    retail_df.drop(columns="date", inplace=True)

    return df
def recent_transaction_features(retail_df):

    retail_df["lag_t28"] = retail_df["unit_price_k"].transform(lambda x: x.shift(28))
    retail_df["rolling_mean_t7"] = retail_df["unit_price_k"].transform(lambda x:x.shift(28).rolling(7).mean())
    retail_df['rolling_mean_t30'] = retail_df['unit_price_k'].transform(lambda x: x.shift(28).rolling(30).mean())
    retail_df['rolling_mean_t60'] = retail_df['unit_price_k'].transform(lambda x: x.shift(28).rolling(60).mean())
    retail_df['rolling_mean_t90'] = retail_df['unit_price_k'].transform(lambda x: x.shift(28).rolling(90).mean())
    retail_df['rolling_mean_t180'] = retail_df['unit_price_k'].transform(lambda x: x.shift(28).rolling(180).mean())

    retail_df.fillna(0, inplace=True)

    return retail_df

retail_df = recent_transaction_features(retail_df)
In [59]:
#correlation matrix
corrmat = retail_df.corr()
f, ax = plt.subplots(figsize=(15, 15))
sns.heatmap(corrmat, vmax=.8,square=False);

Save data for model building¶

In [60]:
retail_df.to_csv("train.csv")

Customer segmentation ¶

Lets look at how pricing depends on the recency, frequency and monetory values¶

In [61]:
retail_df['date_of_sale'].max()
Out[61]:
Timestamp('2161-05-21 00:00:00')
Lets consider the next day as the reference and caluculate the recency and frequency of transactions based on customer_id¶
In [62]:
now = dt.datetime(2161,5,22)
In [63]:
rfm = retail_df.groupby('customer_id').agg({'date_of_sale' : lambda day : (now - day.max()).days,
                               'customer_id': lambda num : len(num),
                              'unit_price_k': lambda price : price.sum()
                             
                             
                             })
col_list = ['Recency','Frequency','Monetary']
rfm.columns = col_list
In [64]:
rfm["R"] = pd.qcut(rfm["Recency"],5,labels=[5,4,3,2,1])
rfm["F"] = pd.qcut(rfm["Frequency"],5,labels=[1,2,3,4,5])
rfm["M"] = pd.qcut(rfm["Monetary"],5,labels=[1,2,3,4,5])
rfm["RFM_Score"] = rfm["R"].astype(int) +rfm["F"].astype(int) + rfm["M"].astype(int)
In [65]:
seg_map = {
  
    r'[1-3][1-2][1-3]': 'Low Engagement & Low Value ',
    r'[1-3][2-5][1-5]': 'Old Activity & High Frequency',
    r'[4-5][1-3][1-5]': 'Recent Activity & Low Frequency',
    r'[1-5][3-4][1-3]': 'High Engagement & Low Value',
    r'[4-5][4-5][3-5]': 'High engagement and high value'
}
In [66]:
rfm['Segment'] = rfm['R'].astype(str) + rfm['F'].astype(str) + rfm['M'].astype(str)
rfm['Segment'] = rfm['Segment'].replace(seg_map, regex=True)
rfm.head()
Out[66]:
Recency Frequency Monetary R F M RFM_Score Segment
customer_id
0 52 31 49956.940000 3 4 4 11 Old Activity & High Frequency
1 5 50 84452.534171 5 4 4 13 High engagement and high value
2 193 5 2497.660000 2 2 1 5 Low Engagement & Low Value
3 774 1 1540.000000 1 1 1 3 Low Engagement & Low Value
4 960 1 1870.000000 1 1 1 3 Low Engagement & Low Value
In [67]:
rfm.groupby('Segment').mean().sort_values('Monetary',ascending=False)
Out[67]:
Recency Frequency Monetary RFM_Score
Segment
High engagement and high value 6.622642 209.188679 296625.338121 13.716981
Old Activity & High Frequency 208.708333 38.229167 73879.013049 9.375000
Recent Activity & Low Frequency 10.032258 8.483871 23777.369975 8.967742
High Engagement & Low Value 6.000000 43.400000 23714.504000 11.400000
Low Engagement & Low Value 409.376812 2.550725 6430.792724 4.956522
In [68]:
rfm.groupby('Segment').mean().sort_values('Recency',ascending=True)
Out[68]:
Recency Frequency Monetary RFM_Score
Segment
High Engagement & Low Value 6.000000 43.400000 23714.504000 11.400000
High engagement and high value 6.622642 209.188679 296625.338121 13.716981
Recent Activity & Low Frequency 10.032258 8.483871 23777.369975 8.967742
Old Activity & High Frequency 208.708333 38.229167 73879.013049 9.375000
Low Engagement & Low Value 409.376812 2.550725 6430.792724 4.956522